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The advent of high-throughput genome scale technologies has enabled us to unravel a large amount of the previously 
unknown transcriptionally active regions of the genome. Recent genome-wide studies have provided annotations of a 
large repertoire of various classes of noncoding transcripts. Long noncoding RNAs (IncRNAs) form a major proportion of 
these novel annotated noncoding transcripts, and presently known to be involved in a number of functionally distinct 
biological processes. Over 18000 transcripts are presently annotated as IncRNA, and encompass previously annotated 
classes of noncoding transcripts including large intergenic noncoding RNA, antisense RNA and processed pseudogenes. 
There is a significant gap in the resources providing a stable annotation, cross-referencing and biologically relevant infor- 
mation. IncRNome has been envisioned with the aim of filling this gap by integrating annotations on a wide variety of 
biologically significant information into a comprehensive knowledgebase. To the best of our knowledge, IncRNome is one 
of the largest and most comprehensive resources for IncRNAs. 

Database URL: http://genome.igib.res.in/lncRNome 



Introduction 

The availability of technology to annotate transcriptomes 
at the genome-scale and single-nucleotide resolution has in 
the recent years provided a new outlook at the transcribed 
regions within the Human genome (1-3). Contrary to the 
popular belief, a large number of genomic loci have been 
presently annotated to be transcriptionally active (4). 
Many of these regions do not have the potential to 
encode for functional proteins and thus constitute a class 
of transcripts, popularly annotated as noncoding RNA (5). 



The noncoding RNA transcripts have been classified into a 
number of subclasses, with the most popular classification 
being based on their size, such as the class of small noncod- 
ing RNAs, which include the well-annotated microRNAs 
(miRNAs) (6), small nucleolar RNAs (snoRNAs), long noncod- 
ing RNAs (IncRNAs) and so on. 

Long noncoding RNAs (IncRNAs), by definition, are tran- 
scripts that are >200 nucleotides in length and do not have 
the potential to encode for proteins exceeding lengths of 
>30 amino acids (7, 8). Transcriptome annotation in recent 
years has significantly expanded the repertoire of IncRNAs, 
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not just in humans, but also in other model systems like 
mouse (9) and zebrafish (10, 11). Although noncoding tran- 
scripts with >200 nucleotide lengths have been clubbed 
together in a general classification of IncRNAs, the mem- 
bers of this class have significant differences in their biolo- 
gical function, genomic loci and regulation. This class 
includes previously known classes of ncRNAs including the 
large intergenic noncoding RNA, transcribed pseudogenes, 
antisense transcripts and several others, including the 
annotated classes of functionally distinct transcripts such 
as Xist, which is involved in X inactivation (12) and Hotair 
(13), involved in epigenetic regulation. 

Functionally, the IncRNA class encompasses a wide var- 
iety of distinct functions like X-chromosome inactivation, 
modulation of chromatin structure, regulation of transcrip- 
tional and posttranscriptional processes and epigenetic 
modifications (14). The biological function of IncRNAs is 
modulated through interaction with other biomolecules 
in the cell, such as DNA, RNA and proteins (15). Recent 
evidence has also indicated putative regulatory roles 
for smaller RNAs processed from IncRNAs, as well as 
for IncRNAs themselves that harbor regulatory motifs 
(16, 17). IncRNAs could be regulated in a different way 
than protein-coding genes (18). Recent evidence also sug- 
gests the role of IncRNAs in several diseases including a 
number of cancers like lung cancers, colorectal and blood 
neoplasia (7, 19). Candidate IncRNAs like NEAT2 and 
MALAT1 have been studied in detail with their relations 
with metastasis in cancers (20-22). Additional candidates 
like ANRIL have been implicated in diseases like atheroscler- 
osis, (23, 24) while a number of candidate genome-wide 
association loci map to regions presently annotated as 
IncRNA genes (25). It has been also suggested that a con- 
ceptual understanding of IncRNA as a function of the bio- 
logical interactions would help to understand disease 
processes and develop potential drug targets (26). 

There are several comprehensive databases for other 
ncRNAs like miRNAs (27-30), snoRNAs (31); however, 
there is a paucity of such databases integrating biologically 
significant annotations for IncRNAs. Although there are 
IncRNA databases coming up like IncRNAdb (32), 
NONCODE (33), etc., the extent of IncRNA annotations 
still remains stringent. IncRNome has been formulated to 
integrate annotations on a wide variety of biologically sig- 
nificant information into a comprehensive knowledgebase. 
To the best of our knowledge, IncRNome is one of the 
largest catalogs for IncRNAs till date, and is available 
online at the URL: http://genome.igib.res.in/lncRNome. 

Database design and architecture 

The IncRNome database has been designed keeping in 
mind both experimental and computational biologists, so 
as to provide ready access to biologically relevant data as 



per the needs of a user. To this end, the structure was 
designed following consultation with a number of experi- 
mental and computational biologists. We created the data- 
base to serve as a comprehensive, user-friendly and 
biologically relevant knowledgebase on human IncRNAs 
built on MySQL 5.6 and having a PHP-based web interface. 
In brief, each IncRNA gene has a single page with 
basic linkouts to other relevant databases, annotation sets 
and relevant categories of information linked in tabs. Five 
categories of information are presently available linked 
with each IncRNA, which includes (i) General Information, 
(ii) Sequence and Structure, (iii) Interactions and Processing, 
(iv) Variations and Conservation and (v) Epigenetic 
Modifications. These categories are connected to the 
genome browser along with the conservation scores of all 
IncRNA transcripts (Supplementary File S1). 

The category 'General Information' hosts information 
like the gene name, Ensembl gene ID, gene type, gene 
status, Ensembl transcript ID, transcript name, transcript 
type, transcript status, chromosome, strand and genomic 
loci, all of which have been fetched from Gencode release 
12 (http://www.gencodegenes.org) (34). The gene names 
were used to map the HGNC ID, Refseq ID, Havana gene 
ID, Havana transcript ID, NCBI ID and chromosomal loci 
from HUGO Gene Nomenclature Committee website (35). 
The length was calculated using the genomic loci. The 
details about IncRNA description, disease associations, 
interactions, overexpression and references were manually 
curated through literature. The alternate transcripts were 
derived using in-house scripts and all IncRNAs were pro- 
vided stable internal IDs. 

The IncRNA sequences were downloaded from UCSC 
Genome Browser Database (36), and the structures were 
predicted using RNAfold version 1.8.5. Both the parenthesis 
structure and the minimum free energy structure predicted 
using the default parameters have been provided. 

The third category comprises IncRNA interactions with 
proteins and other RNAs, IncRNA processing, predicted 
open reading frames (ORFs) and various motifs. The data- 
base hosts 937 quadruplex and 40 hairpins motifs present in 
IncRNAs. Both the motifs have been predicted using tools 
developed 'in-house', Quadfinder (37) and HairpinFetcher, 
respectively. It also hosts 3716 miRNA binding sites on 
IncRNA. More than 10000 binding sites for nine other pro- 
teins, which have been summarized in the section 'Datasets 
and Features', have been provided. These datasets have 
been mapped using PAR-CLIP (38) and CLIP-Seq datasets 
as described in the later sections. There are 6808 predicted 
protein-binding sites also provided in the database, which 
were predicted using Support Vector Machine-based evalu- 
ation of interaction propensities. The 1692 small RNA 
processing sites have also been provided as described in 
the sections below. The fourth category consists of 
345 351 genomic variations mapped to IncRNAs. The 
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database of single nucleotide polymorphisms (dbSNP) 
SNPs were downloaded from UCSC genome browser and 
mapped to IncRNAs. Conservation scores of 66 573 sites 
within IncRNAs have been provided in this category. The 
fifth category provides 11 790 epigenetic marks in the pro- 
moters of IncRNAs. The datasets were downloaded from 
the NIH Human Epigenome Roadmap project and 
mapped to IncRNA promoters. The detailed methods are 
available as Supplementary methods. 

The database also features a comprehensive search 
option, which enables users to search through IncRNome 
using different keywords, such as, IncRNA names, Ensembl 
IDs, known targets, SNPs, diseases, etc. In addition, a separ- 
ate browse option also allows users to browse the database 
through either using the chromosome numbers or different 
IncRNA biotypes. The database also features a genome 
browser, which can be used to browse through the 
genome for representative features and also provides a 
visual representation of the associated genomic annota- 
tions available within the database mentioned above 
along with the conservation scores of IncRNAs. 

Datasets and features 

Long noncoding RNA annotations 

IncRNA annotations were derived from Gencode release 12 
(http://www.gencodegenes.org) (34), which consists of 
11 790 IncRNA genes and 18855 transcripts. The IncRNAs 
transcripts are classified into 10 different biotypes, the stat- 
istics of which has been provided in the Figure 1. In add- 
ition, the datasets of IncRNAs and their HGNC IDs were 
derived from the Human Gene Nomenclature Committee 
website (35), which consisted of 1073 IncRNAs. Additional 
mappings were derived for 99 human IncRNAs from 
IncRNAdb and from literature through manual curation 
and overlapped with each other based on genomic coord- 
inates (Figure 2). A stable internal ID is also provided for 
easy access and to enable cross-referencing between the 
different IDs regularly used by different sequence data- 
bases. The consensus IDs forms the primary reference key 
within IncRNome and has also been used to reference al- 
ternate transcript isoforms. Wherever appropriate, all 
IncRNAs have also been linked back to relevant databases 
such as Ensembl, HGNC and NCBI for quick cross-reference. 

The manual annotation of the functionally characterized 
IncRNAs is provided, which includes information about the 
disease associations, expression and functional significance. 
The annotations are collected from literature surveys and 
manual curations. 

Sequence and structure and motifs 

The IncRNA sequences were downloaded from UCSC 
Genome Browser using genomic locations of individual 



transcripts (36). RNA structures were computed using 
RNAfold with default parameters, which is part of the 
Vienna RNA package version 1.8.5. Our group has previ- 
ously suggested the presence of G-quadruplex motifs in 
IncRNAs that could have potential regulatory functions 
(39). To enable researchers to further take up experiments 
in this area, predictions of potential G-quadruplex forming 
motifs in entire IncRNA transcripts predicted using 
Quadfinder have been included (37), as well as potential 
hairpin structures in the IncRNA have been identified 
using HairpinFetcher. 

IncRNA processing 

A recent study conducted by our lab has pointed to a subset 
of IncRNAs, which could be potentially processed to small 
RNAs having downstream regulatory functions by having a 
dual transcriptional output (40). The same analysis was 
replicated on the present large datasets of IncRNAs. In 
brief, smallRNA clusters were derived from DeepBase (41), 
a comprehensive database of smallRNA annotations 
derived from smallRNA sequencing experiments available 
in the public domain and overlaid on the IncRNA annota- 
tions to derive information on potential IncRNAs that could 
be processed to smallRNAs. 

Protein-RNA interactions 

Recent high-throughput experimental methods for analysis 
of interactions through pull down and sequencing tech- 
niques have provided critical insights into the landscape 
of protein-RNA interactions in the human genome (42). 
One of the major datasets of protein-RNA interactions is 
derived from PAR-CLIP (38) experiments for Argonaute 
(Ago) proteins, which are critical components of the RISC 
machinery involved in miRNA targeting. A comprehensive 
mapping of potential Ago binding sites in the IncRNA tran- 
scriptome is provided by mapping the reads to the human 
transcriptome. Experimental datasets also exist for other 
proteins including IGF2BP2, IGF2BP3, IGF2P1, PTB, PUM2, 
QKI, TNRC6A, TNRC6B and TNRC6C, which have also been 
mapped to the IncRNA transcripts. Because the number of 
experimental datasets for protein-RNA interactions is 
scarce, we also incorporated a computational prediction 
method involving Support Vector Machine-based predic- 
tion of residues in RNA, which could have probable propen- 
sity to interact with proteins (Panwar and Raghava 2012, 
unpublished results). 

Genomic variations and conservation 

Genome-wide association studies in the recent past have 
suggested disease associations, which could be modulated 
by IncRNAs (43). In addition, a number of genomic loci pre- 
viously shown to be associated with diseases have now 
been indicated to fall within IncRNA gene loci. To facilitate 
further in-depth analysis and experimental validation of 
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Figure 1. Distribution of Gencode release 12 IncRNAs according to different biotypes. 



effect of variations on IncRNA, we have included a compre- 
hensive mapping of genomic variations in IncRNA loci. In 
brief, the variations corresponding to dbSNP 135 were 
downloaded (44) and mapped to respective genomic loca- 
tions of IncRNAs. In addition, disease associated variations 
were derived from the NIH Catalog of published genome- 
wide association studies and mapped to respective rslDs. 
The PhastCons conservation scores were downloaded from 
UCSC and the genomic loci were mapped to IncRNAs (45). 

Epigenetic modifications 

A recent report from our group suggests that the pro- 
moters of IncRNAs could be potentially regulated by mech- 
anisms that are distinct from protein-coding genes 
supporting the role of IncRNAs in epigenetic regulation of 
genes. To capture the epigenetic marks, in terms of DNA 
methylation and histone marks, we have provided a com- 
prehensive access to epigenetic marks in the promoters of 
IncRNAs. Briefly the raw datasets were downloaded from 
the NIH Human Epigenome Roadmap project and mapped 
and analyzed as described in Sati eta/. (46). The epigenetic 
marks are also available for browsing through the genome 
browser. The datasets and genomic mappings are compiled 
in Table 1. 

Predicted peptides 

The open reading frames were predicted for all the 
IncRNAs using the Sixpack (http://www.ebi.ac.uk/Tools/st/ 
emboss_sixpack/) tool from EMBOSS. The tool translates 
the given sequence in six frames and peptides starting 
with Methionine and with length >10 amino acids. 



Conclusions and future 
perspectives 

IncRNome is designed to primarily serve as an evidence- 
based resource of IncRNAs and their functionality in 
humans. To this end, we have provided stable reference 
IDs for IncRNA genes and alternate transcript isoforms of 
a gene with cross-references to other sequence and anno- 
tation databases to ensure interoperability and stable 
referencing. The knowledge base integrates biologically 
oriented datasets and resources on IncRNA and manual an- 
notations wherever applicable with the aim of providing a 
one-stop solution for annotation information on IncRNAs. 

The interface allows an easy access to various features of 
IncRNAs comprised within five categories and their suble- 
vels (Supplementary Figure S1). The category 'General' pro- 
vides all the basic annotations of each IncRNA including 
genomic loci, the associated diseases and various linkouts. 
The sequences and the predicted structure of the IncRNA 
are provided in the category 'Sequence and Structure'. The 
IncRNA structures are poorly understood and it becomes 
indispensable to characterize the structures to elucidate 
the structure-function relationships. Specific IncRNA struc- 
tures are essential for binding to proteins, RNA and other 
biomolecules, and to have a better mechanistic insight of 
IncRNA function, elucidation of its structure becomes 
important. IncRNome provides information of various hair- 
pin and quadruplex motifs in IncRNAs found to be essen- 
tial for regulation of a lot of biological processes. Both 
experimental and prediction datasets on RNA-protein 
interactions have been provided for IncRNAs revealing var- 
ious protein and RNA interacting partners of IncRNAs 
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Table 1. Total fields in the database along with the genomic 
loci mapped 



Serial No. 


Database fields 


Total genomic 
loci mapped 


1 


Total IncRNAs 


18855 


2 


Hairpins 


40 


3 


Methylation and histone 
modifications 


11 790 


4 


miRNA binding sites 


3716 


5 


Quadruplexes 


937 


6 


Predicted protein-binding 
sites on IncRNA 


6808 


7 


Small RNA clusters 




8 


Single nucleotide polymorphisms 


295851 



(Supplementary Figure S1). Although the exact mechanism 
how the IncRNA interacts with different partners is still not 
known, our data provide a startup point to the community 
to understand the various regulatory interactions of 
IncRNAs with their respective partners. Genomic variations 
in IncRNAs have been studied to understand the effect of 
SNPs on biogenesis and functions of IncRNAs. The disease- 
associated SNPs present in IncRNAs might provide informa- 
tion about genotype to phenotype associations. The 
distribution of epigenetic marks like DNA methylation 
and histone modifications across transcription start site 
(TSS) of IncRNAs might help in evaluating the effect of chro- 
matin modifications on gene expression (Supplementary 
Figure S1). 

Because the field is emerging and many more IncRNAs 
are being discovered and annotated, thanks to the 



availability of a large number of transcriptome sequencing 
datasets in public domain, IncRNome in the present form 
has many gaps. The primary gap being the paucity of infor- 
mation on expression of IncRNAs in different tissues. With 
the availability of genome-wide transcriptome annotation 
of many tissues in the public domain, we would enrich the 
database with this information. We intend to collaborate 
with other international consortiums to enable cross- 
linking and sharing of resources seamlessly. In future, we 
envisage the database to be available as a community- 
curated and semantically linked interoperable data 
resource. 

Supplementary Data 

Supplementary data are available at Database Online. 
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